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1.  INTRODUCTION 


Here  probability  measure  p  is  defined  but  unknown, 


In  classification,  many  features  or  attributes  often  make 
the  design  of  a  classifier  difficult  and  degrade  its  performance. 
This  is  particularly  pronounced  when  the  number  of 
examples  is  small  relative  to  the  number  of  features.  This 
fact  is  due  to  the  curse  of  dimensionality.  It  states  in  simple 
terms  that  the  number  of  examples  required  to  properly 
compute  a  classifier  grows  exponentially  with  the  number  of 
features.  For  example,  assuming  features  are  correlated, 
approximating  a  binary  distribution  in  a  n  dimensional 
feature  space  requires  estimating  0  ( 2” )  unknown  variables 
[1].  In  such  situations,  the  problem  often  becomes  intractable. 
This  calls  for  reducing  the  number  of  features  in  constructing 
classifiers. 

There  are  many  dimensionality  reduction  techniques  in 
the  literature.  The  two  most  popular  ones  are  principal 
components  analysis  (PCA)  and  linear  discriminant  analysis 
(LDA)  [2].  Both  techniques  have  been  successfully  applied 
to  a  wide  variety  of  practical  problems.  By  projecting  data 
onto  a  linear  sub  space  spanned  by  principal  components, 
PCA  achieves  dimension  reduction  with  the  minimal  data 
reconstruction  error.  On  the  other  hand,  without  taking  into 
account  class  information  PCA  cannot  compute  discriminant 
information  required  by  classifiers.  In  this  pape,  we  are 
concerned  with  LDA. 

In  LDA,  we  are  given  a  set  of  /  examples: 

z  =  {(xi,yi)}‘=v  (l) 

These  examples  are  independently  and  identically 
distributed  (i.i.d.)  from  the  probability  space  Z  =  XxY . 
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x.  EA  C  9U  are  the  q  -dimensional  inputs,  and 

y.  E7=[-M,M] C5t  are  scalar  labels.  According  to 
Fisher's  criterion,  one  has  to  find  a  projection  matrix 
W  E3Uxrf  that  maximizes: 


J(W)  = 


I  WTSbW  I 
I WTSwW  I 


(2) 


where  Sb  and  Sw  are  so-called  between-class  and  within- 
class  matrices,  and  d  denotes  the  dimensions  of  the  reduced 
space.  In  practice,  the  "small  sample  size"  (SSS)  problem  is 
often  encountered,  when  l  <  q  .  In  this  case  Sw  is  singular. 
Therefore,  the  maximization  problem  can  be  difficult  to  solve. 

To  address  this  issue,  the  term  el  is  added,  where  e  is  a 
small  positive  number  and  I  the  identity  matrix  of  proper 
size.  This  results  in  maximizing 

J(W)=\WTSbW \/\WT(Sw+eI)W  I.  (3) 

It  can  then  be  solved  without  any  numerical  problems. 
This  is  a  special  case  of  Friedman's  regularized  discriminant 
analysis  with  regard  to  the  small  sample  size  problem  [3].  In 
[4],  it  is  shown  that  naive  Bayes  outperforms  LDA  under 
broad  conditions.  In  this  work,  we  address  this  problem  in 
the  context  of  dimensionality  reduction. 

In  this  paper,  we  present  a  margin  based  criterion  for 
dimensionality  reduction  that  potentially  provides  a  solution 
to  the  problem  implied  by  the  above  discussion.  In 
particular,  we  show  that 

•  our  margin  based  criterion  for  dimensionality  deduction 
is  closely  related  to  the  average  margin  criterion  [5]; 

•  our  objective  does  not  involve  the  inverse  of  Sw  and  can 
be  optimized  using  algorithms  such  as  semi-definite  progra¬ 
mming,  thereby  avoiding  the  small  sample  size  problem;  and 
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•  we  establish  an  error  bound  for  the  proposed  technique. 

We  demonstrate  the  efficacy  of  our  proposed  technique 
using  a  variety  of  examples.  We  note  that  this  work  extends 
significantly  in  terms  of  theoretical  analysis  and 
experimental  evaluation  of  our  earlier  work  that  appeared  in 
the  proceedings  of  the  IEEE  International  Conference  on 
Data  Mining  2007  [6] . 

The  rest  of  the  paper  is  organized  as  follows.  Section  2 
provides  a  discussion  on  related  work  in  discriminant 
analysis  and  subspace  techniques.  Section  3  introduces  our 
proposal  on  discriminant  analysis  that  derives  linear 
discriminants  in  two  class  problems  by  optimizing  a 
weighted  additive  criterion.  Section  4  shows  a  procedure  that 
optimizes  our  criterion  and  its  correctness.  Section  5 
establishes  an  error  bound  for  the  proposed  technique  by 
showing  its  relationship  to  regularized  least  squares.  Section 
6  demonstrates  how  to  extend  our  criterion  to  the  multi-class 
case.  Section  7  presents  experimental  evaluation  of  the 
proposed  technique  against  several  competing  techniques 
using  a  variety  of  real  data  sets.  Finally,  Section  8  summaries 
our  work  and  points  out  future  research  directions. 

2.  RELATED  WORK 

A  number  of  proposals  has  been  introduced  to  address 
the  computational  difficulty  associated  with  LDA  when  the 
small  sample  size  problem  occurs  ( Sw  becomes  singular).  A 

straightforward  method  is  to  use  the  pseudo-inverse  of  S *  in 
place  of  S~l .  While  simple,  the  method  does  not  guarantee 
that  Fisher's  objective  will  be  optimized  by  the  eigenvector 
matrix  of  S+Sb  .  Furthermore,  computing  S *  itself  is  ill 
posed.  Another  simple  method  is  to  first  use  PCA  to  remove 
the  null  space  of  Sw  ,  and  then  apply  LDA  to  the  reduced 
representation.  Fisherface  is  one  such  example  [7].  However, 
this  method  remains  sub-optimal  because  the  null  space  of 
Sw  potentially  contains  discriminant  information  [8]. 

Another  technique,  newLDA  [8],  first  transforms  the  data 
into  the  null  space  of  Sw  .  It  then  applies  PCA  to  maximize 

the  between-class  scatter  matrix  in  the  transformed  space. 
While  newLDA  mitigates  the  small  sample  size  problem  to 
the  extent  possible,  its  performance  degrades  with  decreasing 
dimensions  of  the  null  space.  A  variant  of  LDA+PCA  is 
proposed  in  [9].  The  method  first  discards  the  null  space  of 
Sw  +  Sb  that  is  the  common  null  space  of  both  Sw  and  Sb . 

And  as  such,  discarding  this  null  space  does  not  lose  any 
discriminant  information.  The  method  then  applies 
LDA+PCA  to  the  reduced  representation  in  the  transformed 
space.  A  direct  LDA  (DLDA)  is  a  method  that  throws  away 
the  the  null  space  of  Sb  [10].  If  Sw  +  Sb  replaces  Sw  ,  DLDA 
reduces  to  PCA+LDA  [10]. 

Discriminant  analysis  based  on  the  average  margin  is 
proposed  in  [5].  The  technique  does  not  involve  inverting 
matrices,  thereby  avoiding  the  small  sample  size  problem. 
This  technique  is  closely  related  to  our  proposal,  as  we  shall 
see  later. 

Recently,  a  dimension  reduction  technique,  called  linear 
feature  extraction  (LFE)  is  introduced  in  [11].  Let  x  be  an 


instance.  We  define  the  near  hit  or  nh  of  x  as  its  nearest 
neighbor  that  comes  from  the  same  class  as  x  .  Similarly,  we 
define  the  near  miss  or  nm  as  the  nearest  neighbor  of  x  that 
comes  from  the  opposite  class.  Then  the  hypothesis  margin 
of  x  with  respect  to  labeled  data  L  is  defined  as  [12] 

cr(x)  =11  x  -  nm(x)  II  -  II  x  -  nh(x)  II .  (4) 

The  hypothesis  margin  is  easy  to  compute  and  lower 
bounds  the  sample  margin  [12]. 

Let  h(x)  =  x-nh(x)  and  m(x)  =  x  -  nm(x) .  We  define 
two  matrices,  near  hit  Sh  and  near  miss  Sm ,  as  follows. 

Sh=^h(xi)h(xi)'  (5) 

i= 1 

and 

i 

Sm  =  ^m(x)m(x)' .  (6) 

1=1 

Instead  of  optimizing  the  margin  (4)  by  selecting 
features,  a  technique  described  in  [11]  computes  a  linear 
transform  that  optimizes  the  following 

max  x\Sm-Sh)x  (7) 

* 

(x'x)2  =  1 

which  is  very  similar  to  our  criterion  (15).  To  be  less 
sensitive  to  noise,  k  near  misses  and  hits  are  often  used  in 
practice  to  optimize  the  margin  for  some  integer  k  [11]. 

We  can  rewrite  Sh  as  follows 

sh  =  2  (x-m+1)(x-m+iy+  ^  (x-m^Xx-m^y  (8) 

xG{+l}  xE{  - 1 } 

=  2  ***  - 2  2  *<1  +  2  + 

xG{+l,-l }  xG{+l }  xG{+l } 

2  ^  xm\  +  ^  m_xwt_x 

x£{-l}  xG{-l} 

=  2  xx' 

xG{+l,-l }  ^  ^ 

Similarly,  Sm  can  be  written  as 

sm=  2  (x  -  /«_,  )(x  -  m_t  y  +  ^  (x  -  m+l )(x  -  m+1 )'  (9) 

xG{+l }  xG{  - 1 } 

=  ^  xx  -  lm+]mt_]  -  lm+]mt_]  +  —  m+xm+x  +  — 

xe{+\-  1}  2  2 

Then 

Sm  -  Sh  =  /(m+1m^  -  2m+lm!_x  +  (10) 

=  lSb. 

This  shows  that  when  near  hits  and  near  misses  are 
extended  to  the  entire  neighborhood,  maximizing  the  margin 
reduces  to  maximizing  the  between-class  scatter  matrix. 
Because  it  ignores  the  within-class  scatter  matrix,  it  cannot 
be  optimal.  This  lends  theoretical  support  to  the  practical 
observation  that  the  average  neighborhood  for  near  hit  and 


Margin  Based  Dimensionality  Reduction  and  Generalization 


The  Open  Artificial  Intelligence  Journal ,  2010,  Volume  4  57 


near  miss  in  Relief  should  be  somewhere  between  1  and  3.  A  MARGIN  CRITERION  FOR  DISCRIMINANT 
1/2  [11].  ANALYSIS 


A  metric  space  dimension  reduction  technique,  called 
discriminant  neighborhood  embedding  (NDE),  is  introduced 
in  [13].  The  idea  is  to  find  a  linear  transform  such  that  in  the 
transformed  space  total  within  class  distance  is  minimized, 
while  total  between  class  distance  is  maximized.  Let 
Xj  £ENBw(xi)  if  x.  is  a  within  class  neighbor  of  x.  ,  and 

Xj  E:NBb(xi)  if  Xj  is  a  between  class  neighbor  of  xf  .  The 

neighborhood  can  be  computed  using  k  nearest  neighbors. 
Then  this  objective  is  accomplished  by  defining  an 
adjacency  matrix  F  ,  where 

1  if  x.  EABw  (Xj )  or  Xj  (x. ); 

Fy=<  -1  if  X,  QVBj  (Xj )  or  Xj  ENBb  (x; ); 

0  otherwise. 

The  objective  is  to  find  P  such  that 

tr(PtX(S  -F)XtP) 

is  minimized,  subject  to  P! P  =  I  .  Here  X  is  the  data  matrix, 
and  S  is  a  diagonal  matrix,  where  Sl7  =  ^ or 


If  we  use  k  =  1  to  compute  NBw  and  NBb ,  we  can  write 
X(S  -  F)  as 

M  -  ( mh(x1 )  •  •  •  mh(xt )), 

where  mh(xi)  =  nm(xi)-nh(xi)  represents  the  difference 
between  the  near  miss  nm{xi)  and  the  near  hit  /i/i(x. )  of  x. , 
respectively  [12].  In  this  case,  the  objective  becomes 
maximizing  tr(Pl MXt P)  over  P  ,  subject  to  PfP  =1  .  This 
is  in  many  ways  similar  to  the  idea  presented  in  [14]  where 
we  compute  a  subspace  by  pooling  local  information 
0 nm(xi )  -  nh(xi  ))xj ,  for  i  =  1  ,*•*,/ .  Here  the  local 
information  is  the  cross  covariance  of  an  instance  and  the 
difference  between  its  near  miss  and  near  hit.  If  we  compute 
NBw  and  NBb  over  the  entire  classes,  i.e.,  k  =  l/2 
(assuming  each  class  has  the  same  number  of  examples),  it 
can  be  shown  that 

X(S-F)Xt  =Sb. 

The  result  is  similar  to  LFE  Eq.  (11).  In  practice,  k  is 
chosen  somewhere  between  1  and  112. 

Several  techniques  have  recently  been  proposed  to 
improve  the  Fisher  criterion  in  heteroscedastic  data  [15,  16]. 
These  techniques  employ  Chernoff  distance  to  capture 
difference  in  variance  between  within  class  matrices.  Like 
the  Fisher  criterion,  it  involves  computing  the  inverse  of 
within  class  matrices.  Thus,  it  potentially  suffers  from  the 
small  sample  size  problem.  In  this  work,  we  are  mainly 
interested  in  addressing  the  small  sample  size  problem  facing 
LDA. 


In  this  section,  we  first  review  LDA  using  Fisher's  criter¬ 
ion,  and  then  go  on  to  investigate  discriminant  analysis  using 
a  margin  based  criterion  and  related  optimization  techniques. 

3.1.  Linear  Discriminant  Analysis 

In  LDA,  within-class,  between-class,  and  mixture  scatter 
matrices  are  used  to  formulate  the  criteria  of  class 
separability.  Consider  a  J  class  problem,  where  m0  is  the 
mean  vector  of  all  data,  and  is  the  mean  vector  of  j  th 

class  data.  A  within-class  scatter  matrix  characterizes  the 
scatter  of  samples  around  their  respective  class  mean  vectors, 
and  it  is  expressed  by 

j  lj 

s» = 2p>  2Lx'  -  m> )(x'  ~mX '  ( 1 1 ) 

7=1  i= 1 

where  lj  is  the  size  of  the  data  in  the  j  th  class  and  /? . 

( ^  jPj  -1 )  represents  the  proportion  of  the  j  th  class 

contribution.  A  between-class  scatter  matrix  characterizes 
the  scatter  of  the  class  means  around  the  mixture  mean  m0  . 
It  is  expressed  by 
j 

sb  =  2P'  (mi  ~  m°  ~  m°  )r  •  (]2) 

/- 

The  mixture  scatter  matrix  is  the  covariance  matrix  of  all 
samples,  regardless  of  their  class  assignment,  and  it  is  given 
by 

Sm  =^(xi  -  mo)(xi  -  m0)T  =SW+Sb.  (13) 

i= 1 

The  Fisher  criterion  is  used  to  find  the  projection  matrix  that 
maximizes  the  objective  (2).  In  order  to  determine  the  matrix 
W  that  maximizes  J(W),  one  can  solve  the  generalized 

eigenvalue  problem:  Sbwi  =A iSwwi  .  The  eigenvectors  cor¬ 
responding  to  the  largest  eigenvalues  form  the  columns  of 
W  .  For  a  two  class  problem,  it  can  be  written  in  a  simpler 
form:  Sww  =  m  =  ml-m2,  where  m1  and  m2  are  the  means 
of  the  two  classes. 

3.2.  Margin  Criterion  for  Linear  Dimensionality 
Reduction 

Here  we  first  focus  on  two  class  problems.  The  multi¬ 
class  case  will  be  discussed  later.  The  goal  of  LDA  is  to  find 
a  direction  w  that  simultaneously  places  two  classes  afar 
and  minimizes  within  class  variations.  Fisher's  criterion  (2) 
achieves  this  goal.  Alternatively,  we  can  achieve  this  goal  by 
maximizing 

J(w)  =  tr(wt(XSb  - SJw ),  (14) 

where  tr  denotes  the  trace  operator,  and  A  >  0  is  a  constant 
that  weighs  relative  importance  of  the  two  terms  Sb  and  Sw 
in  determining  the  outcome  of  linear  discriminants.  Large  A 
values  ignore  within  class  spread,  while  small  A  values 
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penalize  discriminants  that  result  in  large  within  class 
variations. 

Notice  that  tr(Sb )  measures  the  overall  scatter  of  class 
means.  Therefore,  a  large  tr(Sb)  implies  that  the  class  means 
spread  out  in  a  transformed  space.  On  the  other  hand,  a  small 
tr(Sw)  indicates  that  in  the  transformed  space  the  spread  of 

each  class  is  small.  Thus,  when  maximized,  J  indicates  that 
data  points  are  close  to  each  other  within  a  class,  while  they 
are  far  from  each  other  if  they  come  from  different  classes. 


To  see  our  proposal  Eq.  (14)  is  margin  based,  notice  that 
maximizing  tr(Sb-Sw )  is  equivalent  to  maximizing 

J=^^2j^2jPiPjd(Ci,Cj),  where  /?.  denotes  the 

probability  of  class  C{ .  The  interclass  distance  d  is  defined 
as  d(Ct ,Cj)  =  d{mi ,m.j)-  tr(St ) -  tr(Sj ) ,  where  m, 

represents  the  mean  of  class  C. ,  and  St  represents  the 
scatter  matrix  of  class  C . .  As  noted  in  [5],  d(Ci,Cj ) 

measures  the  average  margin  between  two  classes. 
Therefore,  maximizing  our  objective  produces  large  margin 
linear  discriminants.  In  addition,  there  is  no  need  to  calculate 
the  inverse  of  Sw  ,  thereby  avoiding  the  small  sample  size 
problem  associated  with  the  Fisher  criterion. 


4.  COMPUTING  LINEAR  DISCRIMINANTS  WITH 
SEMI-DEFINITE  PROGRAMMING 


Suppose  that  w  optimizes  (14).  So  does  cw  for  any 
constant  c  *  0  .  Thus  we  require  that  w  have  unit  length. 
The  optimization  problem  then  becomes 

max  tr(w'(kSb-Sw)xv) 

w 

subject  to:  II  w  11=1. 

This  is  a  constraint  optimization  problem.  Since 
tr(wl{XSb  -Sw)w)  =  tr((XSb  -Sw)wwf)  =  fr(( AS*  -SW)X), 

where  X=wwl,  we  can  rewrite  the  above  constraint 
optimization  problem  as 

max  tr((XSb-Sw)X) 

X 


/•X  =  1 

X  >  0  (15) 

where  I  is  the  identity  matrix  and  the  inner  product  of 

symmetric  matrices  is  A  •  B  =  ,  and  X  >  0  means 

that  the  symmetric  matrix  X  is  positive  semi-definite. 
Indeed,  if  X  is  a  solution  to  the  above  optimization  problem, 
then  X  >  0  and  I  •  X  =  1  implies  II  w  11=  1 ,  assuming 
rank(  X)=  l. 

The  above  problem  is  a  semi-definite  program  (SDP), 
where  the  objective  is  linear  with  linear  matrix  inequality 
and  affine  equality  constraints.  Because  linear  matrix 


inequality  constraints  are  convex,  SDPs  are  convex 
optimization  problems.  The  significance  of  SDP  is  due  to 
several  factors.  SDP  is  an  elegant  generalization  of  linear 
programming,  and  inherits  its  duality  theory.  For  a 
comprehensive  overview  on  SDP,  see  [17]. 

SDPs  arise  in  many  applications,  including  sparse  PCA, 
learning  kernel  matrices,  Euclidean  embedding,  and  others. 
In  general,  generic  methods  are  rarely  used  for  solving 
SPDs,  because  their  time  grows  at  the  rate  of  0(n3)  and 
their  memory  grows  in  0(n2),  where  n  is  the  number  of 
rows  (or  columns)  of  a  semidefinite  matrix.  When  n  is 
greater  than  a  few  thousands,  SDPs  are  typically  not  used. 
However,  there  are  algorithms  that  have  a  good  theoretical 
foundation  to  solve  SDPs  [17].  In  addition,  semidefinite 
programming  is  a  very  useful  technique  for  solving  many 
problems.  For  example,  SDP  relaxations  can  be  applied  to 
clustering  problems  such  that  after  solving  a  SDP,  final 
clusters  can  be  computed  by  projecting  the  data  onto  the 
space  spanned  by  the  first  few  eigenvectors  of  the  SDP 
solution.  For  large-scale  problems,  there  is  a  tremendous 
opportunity  for  exploiting  special  structures  in  problems,  as 
those  suggested  in  [18,  19]. 

Assume  rank  ( X )  =  1 .  Since  X  is  symmetric,  one  can 
show  that  rank  (X)  =  1  iff  X  =ww*  for  some  vector  w  . 
Therefore,  we  can  recover  w  from  X  as  follows.  Select  any 
column  (say  the  i  th  column)  of  X  such  that  X(l,/)^0, 
and  let 

w  =X(:,i)/X(l,0,  (16) 

where  X(:,z)  denotes  the  i  th  column  of  the  matrix  X. 
Thus,  our  goal  here  is  to  ensure  the  solution  X  to  the  above 
constraint  optimization  problem  has  rank  at  most  1 . 

One  way  to  guarantee  rank  (  X )  =  1  is  to  use  rank  ( X )  = 
1  as  an  additional  constraint  in  the  optimization  problem. 
However,  the  constraint  rank(  X )  =  1  is  not  convex  and  the 
resulting  problem  is  difficult  to  solve.  It  turns  out  that  the 
above  formulation  (15)  is  sufficient  to  ensure  that  the  rank  of 
the  optimal  solution  X  to  Eq.  (15)  is  one,  i.e.,  rank  (  X )  =  1 . 

Theorem  1  Let  X  be  the  solution  to  the  semi-definite 
program  (15).  Also,  let  rank(X)  =  r  .  Then  r  =  rank(X)  =  1 . 

Proof.  We  rewrite  XS,-S=2XS,-S  ,  where 

O  W  u  In 

S m  =Sb+Sw  .  Let  null  (A)  denote  the  null  space  of  matrix 
A.  Since  null(Sm)  Q  null(Sb) ,  there  exists  a  matrix 
P  &Rqxs  that  simultaneously  diagonalizes  Sb  and  Sm  [20], 
where  s  <  min{/  -  1,  q }  is  the  rank  of  Sm  . 

The  matrix  P  is  given  by 

P  =  Q^n-U, 

where  Am  and  Q  are  the  eigenvalue  and  eigenvector 
matrices  of  Sm ,  and  U  is  the  eigenvector  matrix  of 
K~mQtSh QA~m .  Thus,  the  columns  of  P  are  the 
eigenvectors  of  2A Sb  -  Sm  and  the  corresponding 
eigenvalues  are  2XAb  -  I .  We  then  have 
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P'SbP  =  \,  P'SmP  =  I.  (17) 

where  Ab  =diag{ox,---,os}  . 

Consider  the  range  of  P  over  Y  E9LX<?  with 
rank(Y)  =  s  .  The  range  W  =  PY  includes  all  qxq  matrices 

with  rank  =  s  .  Then 


when  the  hypothesis  space  H  is  linear 
H L  =  {  /  I  f(x)  =  wtx }  .  Here  y  represents  the  regularization 
constant.  In  this  case,  we  have  II  f  \\2H  =  wlw  ,  and  thus 


1  1 

W„„,  =argmin-^0',  -  w'x,.)2 +yw'w. 
w  l  i=  1 


opt 


(19) 


maxtiiW' (2lSh  -  SJW)  =  ma xH(PY)‘ 

W  Y 

(2A Sb  -  Sm  )PY)  =  maxMY'  (2AA,  -l)Y). 

Y 

It  is  straightforward  to  show  that  the  maximum  is 
attained  by  Y  =  [exe2---er;0\ ,  where  et  is  a  vector  whose  i  th 
component  is  one  and  the  rest  is  0.  From  this  it  is  clear  that 
W  =  PY  consists  of  the  first  r  columns  of  P  ,  i.e.,  the 
eigenvectors  corresponding  to  2A cr  -  1  >  0  . 

Now,  since  X  =  WW 1 ,  we  have  X  =  .  Thus, 

tr(X)  =  =  r. 

i=i 

However,  the  constraint  I  •  X  =  1  states  that  tr(X)  =  1 .  It 
follows  that  r- 1.  That  is,  rank(X)  =  1.  Therefore,  our 
procedure  for  computing  w  from  the  matrix  X  Eq.  (16)  is 
guaranteed  to  produce  the  correct  answer.  We  call  our 
algorithm  SDP-LDA. 

While  the  criterion  Eq.  (14)  is  different  from  the  Fisher 
criterion  Eq.  (2),  it  is  very  competitive1.  Here  we  use  the  Iris 
data  to  show  that  Eq.  (14)  is  very  competitive.  In  this 
example,  all  three  classes  of  the  Iris  data  are  used,  where 
each  class  has  50  examples.  We  randomly  choose  60%  as 
training  and  the  remaining  40%  as  testing.  Since  Sw  is  non¬ 
singular,  the  small  sample  size  does  not  occur.  Since  there 
are  three  classes,  we  use  two  linear  discriminants  computed 
according  to  the  Fisher  criterion  to  project  the  data.  For  the 
proposed  criterion  A S-b-Sw  Eq.  (14)  we  use  one 

discriminant  to  represent  the  data  in  the  reduced  space.  One 
nearest  neighbor  rule  is  used  to  predict  the  class  label  in  the 
reduced  space.  The  average  accuracies  over  200  runs  are 
0.9634  Eq.  (14)  and  0.9510  (Fisher),  respectively.  This 
example  shows  that  Eq.  (14)  is  indeed  competitive  against 
the  Fisher  criterion,  even  with  less  resources.  We  will  see 
very  similar  results  later  in  the  experimental  section. 

5.  ERROR  BOUND 

We  establish  an  error  bound  for  our  learning  algorithm  in 
two  steps.  First,  we  show  the  relationship  between  our 
learning  algorithm  and  regularized  LDA  Eq.  (3).  Second,  we 
show  that  maximizing  Eq.  (3)  produces  the  same  solution  as 
minimizing  the  regularized  least  squares 

/  =  argmin-  V(yf  - /(x,  ))2 +y  II  /  \\2H,  (18) 

/e/r  l 


1  In  [24],  it  is  incorrectly  shown  that  Eq.  (14)  is  equivalent  to  the  Fisher 
criterion  Eq.  (2). 


We  then  use  the  bound  for  w  t  to  bound  our  learning 

algorithm.  The  following  lemma  establishes  a  relationship 
between  regularized  LDA  (3)  and  our  proposal  (15). 

Lemma  2  Let  regularized  LDA  be  defined  by  (3).  Then 
solving  Eq.  (15)  is  a  special  case  of  solving  regularized  LDA 
in  two  class  problems. 

Proof.  Rewrite  (15)  as 

J(w)  =  z^tr{wl  (XSb  -  Sw  )w)  +  a(l  -  tr(wwt )),  (20) 

where  a  is  the  Lagrangian  multiplier.  We  then  have 

f-  =  (Mb  - Sw)w  -  aw. 
dw 

Setting  the  above  to  zero,  we  obtain 

(Sw+aI)w=XSbw.  (21) 

Solving  the  above  is  equivalent  to  solving 

(Sw  +  al)w  =  ml-m2,  (22) 

since  Sbw  is  always  in  the  direction  mx  -  m2  (scale  factor 
for  w  has  no  consequence),  where  mi  (i  =  1,2)  represents 
the  mean  of  class  i .  Thus,  setting  e  in  Eq.  (3)  to  a  results 
in  linear  solutions  to  regularized  LDA  (3)  in  two  class 
problems. 

The  above  lemma  states  that  our  proposal  (15)  and 
regularized  LDA  (3)  produce  the  same  linear  discriminants 
in  two  class  problems.  We  now  show  in  the  following  lemma 
that  the  solution  to  (3)  is  equivalent  to  the  solution  to  (19). 

Lemma  3  The  linear  solution  to  the  regularized  Fisher's 
criterion  (3)  is  equivalent  to  the  linear  solution  wopt  to  the 

least  squares  criterion  (19),  up  to  a  constant,  in  two  class 
problems. 

The  proof  of  the  Lemma  is  given  in  Appendix  A. 
Combining  the  above  two  lemmas,  we  conclude  that 

Theorem  4  The  linear  solution  to  Eq.  (15)  is  equivalent 
to  the  linear  solution  wopt  to  the  least  squares  criterion  (19), 
up  to  a  constant,  in  two  class  problems. 

Recall  that  LDA  tries  to  find  a  transformation  /  such 
that  the  transformed  data  has  a  large  between  class  difference 
and  small  within  class  variation.  The  equivalence  between 
Eq.  (19)  and  Eq.  (3)  can  be  stated  in  another  way:  one 
minimizes  the  mean  squared  error  with  respect  to  each  class 
mean  while  keeping  the  mean  difference  fixed. 

We  define  fn  as  the  best  function  that  minimizes  the 

j  p 

mean  squared  error.  That  is 
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fp  =  argmin fjy  -  f(x))2dp.  (23) 

Given  the  training  data  z  Eq.  (1),  the  goal  of  learning  is 
to  find  a  function  fz  that  comes  as  close  as  possible  to  fp 

fop,  =argminj'z(/,  -fp)2 dp. 

4 

Here  p  is  unknown  and  so  is  fp .  Instead,  we  consider 

empirical  error  minimization.  However,  solving  empirical 
error  minimization  often  leads  to  over  fitting  and  the  solution 
is  unstable  if  no  constraint  is  placed  on  fz .  Thus,  we 
minimize  the  regularized  empirical  error  Eq.  (19). 

The  equivalence  between  our  proposal  and  (19)  allows  us 
to  establish  an  error  bound  for  (15)  in  the  following  theorem, 
due  to  [21,  22].  First,  let 

Lkf(x)=fxf(x)k(x,x)dpx,  (24) 


0((V) 

m 

by  taking 

r=(log(4M)l)i  (28) 

m 

This  is  accomplished  primarily  through  the  employment 
of  a  concentration  inequality,  while  still  following  the 
essential  outline  of  the  approach  described  in  [21].  The  error 
bound  is  not  only  sharper,  but  also  provides  a  guide  to  the 
asymptotic  value  of  the  regularization  parameter  y  Eq.  (28). 

It  shows  that  for  a  fixed  <5  ,  y  goes  to  zero  as  the  number  of 
training  examples  goes  to  infinity,  as  expected.  For  a  given 
training  data  set,  high  confidence  (low  6  )  requires  large  y  . 
Notice  that  there  is  no  conflict  between  y  defined  in  [21] 
and  the  one  in  [23].  y  in  [21]  is  optimal  within  the  settings 
discussed  in  the  paper. 


where  px  is  the  marginal  probability  measure  on  X  and 

k(x,x)  is  a  continuous  and  symmetric,  positive  semi- 
definite  function  on  X  x  X .  Throughout  the  paper,  we 
assume  that  there  exists  a  positive  constant  M  that  satisfies 

\f(x)-y\<M  (25) 

almost  everywhere. 

Theorem  5  Let  z  Eq.  (1)  be  randomly  drawn  according 
to  p,  and  fp  be  defined  by  Eq.  (23).  Then  for  any 

0  <  ^  <  1 ,  with  confidence  1  -6  the  error  bound  for  the 
solution  wopt  of  Eq.  (19),  thus  Eq.  (15),  is  given  by: 


J(WoP,-fPfdpx  <S(y)  +  A(y), 


(26) 


where  A(y)  (approximation  error  in  this  context)  and  S(y) 
(sample  error)  are  given  by 


My)  -  ym  H  Qfpf 


and 


cv  ,  32 M\y+Ckf  ^ 

S(y)  = - 2 — v 


where  v*(m,<5)  is  the  unique  positive  solution  of 

m  3  Am 

—  v  -  ln( - )v-c  =  0. 

4  6 

Here  Ck ,  c  >  0  depend  only  on  X  and  k  . 


(27) 


6.  MULTI-CLASS  DLA 

We  have  presented  a  margin  based  criterion  as  an 
alternative  to  Fisher's  criterion.  We  have  shown  how  to 
optimize  our  criterion  with  semi-definite  programming  to 
obtain  the  optimal  linear  transform  in  two  class  problems, 
where  one  dimensional  projection  is  adequate.  However, 
LDA  is  generally  used  to  find  a  subspace  with  d  dimensions 
for  multiple  class  problems.  In  this  section  we  extend  our 
SDP  approach  to  LDA  to  the  multi-class  case. 

We  start  with  A,=XSh-S  ,  where  Sh  and  Su,  are 

LOW'  u  w 

computed  as  in  the  two  class  case.  We  solve  the  problem  in 
(15) 


max  tr{AxX) 

X 


X  >  0 
/•X  =  1 

to  obtain  the  solution  Xx  =  wxw[ .  Once  we  have  obtained  the 
solution  Xj  =wjwtj,  we  inflate  A  .  to  obtain 

Aj+1=Aj+Xj, 

from  which  we  compute  Xj+1  =wj+lwtj+l  for  y  =  l,-*-,C-l , 
where  C  represents  the  number  of  classes.  Here  X.  s  force 
wj+l  to  be  orthogonal  to  s,  as  desired.  To  see  this,  we 
write 


w'+1(A.)w;+1  =w'j+1(Al+wlw[+---  +  wjw'j)wj+l 

=  w,j+lAlwj+l+(w‘j+lwl)2  +---  +  (w‘]+lwj)2. 


Decomposition  5,(y)  +  A(y)  represents  the  bias-variance 
tradeoff.  Given  a  hypothesis  space,  A(y)  measures  the 
"error"  between  the  optimal  function  learnable  from  the 
hypothesis  space  and  the  true  target  fp.  S(y)  on  the  other 

hand  bounds  the  sample  error  and  is  essentially  derived  by 
the  law  of  large  numbers. 

We  can  bound  5,(y)  +  A(y)  by  [23]  (Corollary  5) 


Thus,  wtj+l(Aj)w j+l  is  minimized  when  wj+1  minimizes  A, 
and  is  orthogonal  to  the  s,  since  Wj  *  0  for  all  j  .  Its 

complexity  is  at  most  C  times  the  complexity  for  two  class 
problems. 

It  can  also  be  shown  that  the  solution  obtained  as  such  is 
the  same  as  the  solution  obtained  by  treating  the  multi-class 
problem  as  C  binary  problems,  where  the  i  th  two  class 
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problem  treats  the  i  th  class  as  one  class  and  all  remaining 
classes  as  the  second  class.  Each  binary  class  problem  is 
solved  first,  and  after  finding  all  subspaces,  PCA  is  applied 
to  find  eigenvectors  having  the  largest  eigenvalues,  which 
are  the  solution  of  the  original  multi-class  LDA  problem. 

7.  EXPERIMENTS 

In  this  section  we  compare  the  proposed  technique  with 
several  competing  subspace  techniques  using  a  number  of 
examples,  including  multi-class  facial  and  binary  data  sets. 

7.1.  Competing  Methods 

The  following  subspace  techniques  will  be  evaluated.  All 
procedural  parameters  are  determined  through  10-fold  cross 
validation. 

SDP-LDA:  Our  proposed  algorithm  (15).  To  solve  the 
semi-definite  program,  we  used  the  general  purpose 
optimization  software  SeDuMi  [24]. 

PCA+LDA:  Apply  PCA  to  remove  the  null  space  of  Sw 
first,  then  maximize  [7,  25] 

j(W)=\wTsbw  \/\wTsyv  i. 

S-LDA:  Same  as  PCA+LDA  but  maximizing  [2,  26]: 
J{W)=\WT  SbW  \  /  \WT  SmW  \ . 

newLDA:  If  Sw  is  full  rank  then  solve  regular  LDA;  else 
in  the  null  space  of  Sw  ,  find  the  eigenvectors  of  Sb  with 
largest  eigenvalues  [8]. 

DLDA:  Apply  PCA  to  remove  the  null  space  of  Sb  first, 
then  find  the  eigenvectors  of  Sw  corresponding  to  the 
smallest  eigenvalues  [10]. 

DNE:  The  discriminant  neighborhood  embedding 
algorithm  presented  in  [13].  It  finds  a  linear  transform  that 
maps  within  class  samples  closer  together  and  between  class 
data  samples  away  from  each  other. 

LFE:  The  linear  feature  extraction  algorithm  proposed  in 
[11].  Similar  to  Relief,  it  finds  a  linear  subspace  by 
maximizing  the  hypothesis  margin.  Notice  that  LFE  is  a  two 
class  subspace  technique.  And  as  such,  it  is  applied  to  the 
two  class  problems  (Section  7.3). 

C-LDA:  The  linear  dimensionality  reduction  algorithm 
using  the  Chernoff  criterion  [15].  This  method  is  applied  to 
the  two  class  data  experiments  only  (Section  7.3),  since  it 
has  some  difficulty  in  computing  the  inverse  of  within  class 
matrices  on  the  two  image  data  sets. 

It  should  be  noted  that  PCA+LDA  and  S-LDA  can  be 
equivalent  when  Sw  and  Sm  span  the  same  subspace. 
However,  they  are  different  when  Sb  totally  or  partially 
spans  the  null  space  of  Sw  ,  thus  Sw  and  Sm  span  different 
subspaces.  For  face  recognition  the  latter  case  turns  out  to  be 
more  common.  In  [8],  Chen  et  al.  show  that  the  null  space  of 
Sw  contains  discriminant  information.  They  also  show  that 
Scatter-LDA  is  not  "optimal"  in  that  it  fails  to  distinguish  the 
most  discriminant  information  in  the  null  space  of  Sw  .  Thus 
they  propose  the  newLDA  method.  However,  newLDA  fell 


short  of  making  use  of  any  information  outside  of  that  null 
space. 

In  all  the  experiments,  the  data  in  a  reduced  space  are 
normalized  to  have  zero  mean  and  unit  variance  along  each 
dimension  (i.e.,  Gaussian  normalization).  First,  the  mean  and 
variance  along  each  dimension  are  calulated  using  the 
training  data.  Then,  the  training  mean  and  variance  are  used 
to  normalize  the  test  data. 

7.2.  Facial  Images 

7.2.1.  Feret  Face  Data 

The  FERET  face  data  [27]  is  now  a  standard  facial 
database  for  testing  and  evaluating  facial  classification 
algorithms.  Each  image  has  384  x  256  pixels.  Sample 
images  used  in  the  experiments  are  shown  in  Fig.  (1).  The 
images  used  here  involve  variations  in  facial  expressions  and 
illumination. 

For  the  FERET  data,  we  extracted  150  images,  where 
there  are  50  individuals  with  three  images  from  each.  We 
randomly  choose  two  images  per  person  for  training,  and  the 
remaining  one  for  testing.  Thus,  for  the  FERET  data  we  have 
100  training  and  50  test  images.  Each  image  is  preprocessed 
to  align  along  the  eyes  and  reduced  in  size  to  150x130 
pixels.  The  corresponding  preprocessed  images  are  shown  in 
Fig.  (2).  The  preprocessed  images  of  150x130  pixels  are 
first  transformed  into  a  space  of  149  dimensions  spanned  by 
the  150  images  through  PCA.  As  a  result,  we  are  facing  the 
challenge  of  the  small  sample  size  problem. 


Fig.  (1).  Feret  sample  images. 

(%(53  O  «T  ^ 

*  $  %  &  m  i&j 

Fig.  (2).  Normalized  Feret  sample  images. 

Subspaces  are  calculated  from  the  training  data,  and  the 
one  nearest  neighbor  (NN)  classifier  is  used  to  obtain 
accuracy  after  projecting  the  data  onto  the  subspace.  We 
prefer  a  simple  classifier  in  order  to  highlight  the  subspace 
methods.  To  obtain  average  performance,  each  methods 
repeated  10  times.  The  average  accuracy  as  a  function  of 
dimensionality  is  shown  in  Fig.  (3). 

The  X  -axis  represents  the  dimensionality  of  the 
subspace.  For  each  technique,  the  higher  the  dimension,  the 
less  discriminant  the  dimension.  For  most  techniques,  the 
accuracy  rate  increases  quickly  around  the  first  10 
dimensions,  and  then  increase  slowly  with  additional 
dimensions. 

SDP-LDA  is  uniformly  better  than  any  other  algorithms 
on  the  Feret  data,  especially  at  lower  dimensions, 
demonstrating  its  efficacy.  It  achieves  the  highest  accuracy 
rate  of  0.922  on  the  Feret  data.  newLDA  performs  quite  well 
in  the  experiment,  demonstrating  that  the  most  discriminant 
information  is  in  the  null  space  of  Sw  ,  for  the  facial 
recognition  tasks.  On  the  other  hand,  S-LDA  does  not 
perform  well  at  lower  dimensional  subspaces.  But  it 
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Fig.  (3).  Comparison  of  SDP-LDA,  DLDA,  LDA,  newLDA,  S- 
LDA,  and  DNE  on  the  Feret  image  data. 

eventually  performs  better  than  PCA+LDA,  when  the 
number  of  dimensions  is  large  enough.  All  methods  achieve 
higher  accuracy  rates  toward  higher  dimensional  subspaces, 
which  is  not  surprising,  for  it  is  a  50  class  problem.  It  should 
be  noted  that  the  performance  of  newLDA  and  S-LDA  (its 
tail  is  not  shown  in  the  plot)  drops  quickly  with  unnecessary 
dimensions. 

7.2.2.  ORL  Face  Data 

The  ORL  data  set  [28]  is  used  in  this  experiment.  The 
size  of  each  image  is  92x112  .  We  extracted  120  images, 
where  there  are  40  subjects  with  three  images  from  each. 
Sample  images  are  shown  in  Fig.  (4).  Similar  to  the  Feret 
data,  the  images  of  92  x  1 12  pixels  are  first  transformed  into 
a  space  of  119  dimensions  spanned  by  the  120  images 
through  PC  A. 


Fig.  (4).  ORL  sample  images. 

We  randomly  choose  two  images  per  person  for  training, 
and  the  remaining  one  for  testing.  We  have  80  training  and 
40  test  images.  Again,  the  one  nearest  neighbor  classifier  is 
used  to  obtain  accuracy  rates  after  projecting  the  data  onto 
the  subspace.  To  obtain  average  performance,  each  method 
is  repeated  10  times.  The  average  accuracy  as  a  function  of 
dimensionality  is  shown  5 . 

SDP-LDA  is  again  uniformly  better  than  any  other 
algorithms  on  the  ORL  data.  It  achieves  the  highest  accuracy 
rate  of  0.8875  on  ORL.  For  most  techniques,  the  accuracy 
rate  increases  quickly  around  the  first  10  dimensions,  and 
then  increase  slowly  with  additional  dimensions.  The  results 
are  similar  to  what  we  observe  on  the  Feret  data. 

7.3.  Binary  Data  Sets 

In  these  experiments,  we  compare  the  seven  competing 
methods  on  a  number  of  two  class  classification  problems. 
We  use  12  data  sets  from  the  UCI  database  and  the  cat  and 


Fig.  (5).  Comparison  of  SDP-LDA,  DLDA,  LDA,  newLDA,  S- 
LDA,  and  DNE  on  the  ORL  image  data. 


dog  data  (CatDog).  They  are  all  two  class  classification 
problems.  The  cat  and  dog  data  set  is  composed  of  two 
hundred  images  of  cat  and  dog  faces.  There  are  equal 
number  of  cats  and  dogs  in  the  data  set.  Each  image  is  a 
black-and-white  64  x  64  pixel  image,  and  the  images  have 
been  registered  by  aligning  the  eyes.  PCA  is  first  applied  to 
reduce  the  number  of  dimensions  from  4096  to  199. 

For  each  data  set,  we  randomly  choose  60%  as  training 
and  the  remaining  40%  as  testing.  We  train  the  seven 
methods  on  the  training  data  and  obtain  a  one-dimensional 
subspace.  We  then  project  both  training  and  test  data  on  the 
chosen  subspace  and  use  the  INN  classifier  to  obtain  error 
rates.  Note  that  for  the  two  class  case,  one  dimensional 
subspace  is  sufficient.  Again,  all  procedural  parameters  for 
all  the  methods  are  chosen  through  10-fold  cross-validation. 
We  repeat  the  experiments  30  times  on  each  data  set  to 
obtain  the  average  accuracy  rates. 

The  average  accuracy  rates  over  30  runs  are  shown  in 
Table  1.  Both  SDP-LDA  and  S-LDA  (Scatter-LDA)  come 
out  first  five  times  out  of  the  13  data  sets,  while  each  of  the 
Chernoff  LDA,  LFE  and  DNE  methods  comes  out  first  once. 
The  differences  are  significant  only  on  three  data  sets:  New 
Thyroid,  Sonar,  and  Cat  and  Dog  (pair-t  test  with  a  95% 
confidence  level).  Overall,  three  methods:  SDP-LDA,  S- 
LDA  and  C-LDA  are  very  similar  on  the  binary  data  sets. 
Another  way  to  look  at  these  methods  is  to  see  how  they 
perform  across  tasks.  That  is,  we  want  to  see  how  well  or 
robust  a  method  can  perform  when  a  given  task  is  not  in 
favor  of  this  particular  method.  The  following  can  be  used  to 
measure  robustness.  For  each  method  m  we  compute  the 
ratio  bm  between  its  error  rate  em  and  the  smallest  error  rate 
over  all  methods  being  compared  in  a  particular  example: 


One  can  see  that  bm  for  the  most  robust  method  are  centered 
around  one  with  zero  spread. 

Fig.  (6)  plots  the  distribution  of  bm  for  each  method  over 
the  13  data  sets.  The  box  area  represents  the  lower  and  upper 
quartiles  of  the  distribution  that  are  separated  by  the  median. 
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Table  1.  Classification  Accuracy  Rates  in  Subspaces  Computed  by  the  8  Competing  Methods  Using  INN  Classifier,  on  the  13  Data 
Sets 


Data  Set 

DLDA 

PCA+LDA 

newLDA 

SDP-LDA 

S-LDA 

LFE 

DNE 

C-LDA 

BreastCancer 

0.6541 

0.6445 

0.6445 

0.6536 

0.6659 

0.6386 

0.6327 

0.6609 

Credit 

0.7969 

0.7488 

0.5952 

0.8062 

0.8162 

0.8088 

0.8004 

0.8013 

Heart  Cleve 

0.7551 

0.7530 

0.7530 

0.7644 

0.7780 

0.7581 

0.7466 

0.7695 

Heart  Hun 

0.7607 

0.7440 

0.7077 

0.7726 

0.7679 

0.7526 

0.7641 

0.7628 

Ionosphere 

0.7604 

0.7514 

0.7514 

0.8421 

0.8282 

0.7379 

0.7268 

0.8364 

Letters 

0.9482 

0.9379 

0.9379 

0.9692 

0.9683 

0.9530 

0.9521 

0.9686 

Thyroid 

0.8203 

0.7936 

0.7936 

0.8221 

0.8233 

0.9634 

0.9529 

0.8267 

Pima 

0.6564 

0.6831 

0.6831 

0.6824 

0.6897 

0.6728 

0.6661 

0.6883 

Glass 

0.9165 

0.8588 

0.8588 

0.9059 

0.9182 

0.9082 

0.9035 

0.9153 

Iris 

0.8700 

0.9362 

0.9362 

0.9437 

0.9437 

0.8912 

0.8988 

0.9450 

Cancer  Wis 

0.9596 

0.9559 

0.9559 

0.9577 

0.9551 

0.9557 

0.9605 

0.9550 

Sonar 

0.6774 

0.5146 

0.5030 

0.7122 

0.6896 

0.6744 

0.6659 

0.6848 

CatDog 

0.6816 

0.7152 

0.5816 

0.8025 

0.7741 

0.6962 

0.6968 

0.7741 

Average 

0.7890 

0.7721 

0.7463 

0.8180 

0.8168 

0.8008 

0.7975 

0.8140 

The  outer  vertical  lines  show  the  entire  range  of  values  for 
the  distribution.  As  shown  in  Fig.  (6),  the  spread  of  the  error 
distribution  for  SDP-LD A  is  narrow  and  close  to  1 ,  followed 
by  S-LDA  and  C-LDA.  The  results  on  both  the  image  and 
binary  data  sets  clearly  demonstrate  that  SDP-LDA  obtained 
the  most  robust  performance  over  these  data  sets. 


Fig.  (6).  Error  distributions  of  DLDA,  PCA+LDA,  newLDA,  SDP- 
LDA,  S-LDA,  LEE,  DNE  and  C-LDA  on  the  13  data  sets. 

8.  SUMMARY 

This  paper  presents  a  margin  based  criterion  for 
dimensionality  reduction  that  potentially  provides  a  solution 
to  the  small  sample  size  problem,  often  associated  with  the 
Fisher  criterion.  In  particular,  the  paper  has  shown  that  (1) 
the  proposed  criterion  (15)  for  dimensionality  deduction  is 
closely  related  to  the  average  margin  criterion;  (2)  the 
criterion  does  not  involve  the  inverse  of  Sw  and  can  be 
optimized  using  algorithms  such  as  semi-definite 
programming,  thereby  avoiding  the  small  sample  size 


problem;  and  (3)  an  error  bound  is  established  for  the 
proposed  technique.  The  paper  demonstrates  the  efficacy  of 
the  proposed  technique  using  a  number  of  real  examples,  and 
the  results  show  that  the  proposed  technique  registered 
competitive  performance  against  several  competing  methods 
in  several  examples. 

A.  Proof  of  Lemma  3 

Proof.  We  rewrite  Eq.  (19)  as 

i 

^(y.  -wtxi)2  +  ylwlw  =  (y  -  Xlw)1  (y  -  Xtw)  +  ylwtw  (29) 

i= 1 

-l-  2(lxmx  -  l2m2 y  w  +  wl (Sm  +  yll)w  (30) 

where  X  =  (xxx2  •  •  •  jq )  and  y  =  (yxy2  *  •  -yt  .  Here  we  use  the 
fact  that  Xy  =  lxmx-  l2m2 .  Taking  the  derivative  with  respect 
to  w  and  setting  the  result  to  0,  we  have 

(Sm  +  y  ll)w  =  (lxmx  -  l2m2 ).  (31) 

We  show  that  three  equations  (Sm  +  yll)w  =  (lxmx  -  l2m2) , 
(Sm  +  yll)w  =  mx-m2,  and  (Sw  +  yll)w  =  mx-m2  have  the 
same  solution  w  up  to  a  constant,  given  that  the  overall 
mean  is  0. 

First  we  show  that  two  equations  (Sm+yll)w  =  m  and 
(Sw+yll)w  =  m  have  the  same  solution  (same  set  of 
eigenvectors),  where  m  =  mx-m2  is  the  mean  difference  of 
the  two  classes. 

Clearly  solving  Sww  =  m  is  equivalent  to  solving  [2] 
(Sw+yliylSb^  =  ^  A 


(32) 
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where  O  and  A  are  the  eigenvector  and  eigenvalue 
matrices  of  ( Sw+yliylSb .  Since  we  have  Sw  =  Sm  -  Sb , 
following  [2]  (pp.  454),  Eq.  (32)  can  be  written  as 
(Sm-Sb+yllYS>A  =  Sb<S> 

(5m+y//)0A  =  5,0(/  +  A) 

(Sm+y//)-1V&  =  <DA(/  +  A)-1.  (33) 

This  shows  that  O  is  also  the  eigenvector  matrix  of 
(Sm  +  ylITlSb ,  and  its  eigenvalue  matrix  is  A (/  +  A)-1 . 

Without  loss  of  generality,  let  us  assume  that  the 
components  ai  of  A  are  such  that  a1  >  •  >  an .  It  can  be 

shown  that  the  corresponding  components  of  A(/  +  A)-1 
preserve  the  same  relationship 

ai  an 

— - —  >  •  •  •  >  — - — . 

1  +  cq  1  +  ccn 

That  is,  the  t  eigenvectors  of  (Sm  +  yll)~]  Sb  corresponding 
to  the  t  largest  eigenvalues  are  the  same  as  the  first  t 
eigenvectors  of  matrix  ( Sw  +yliylSb .  As  a  special  case  (two 
class  problem),  the  solutions  resulting  from  (Sm  +yll)w  =  m 
and  (Sw  +yll)w  =  m  share  the  same  "eigenvector". 


Now  we  show  that  (Sm  +yll)w  =  (/1m1  - l2m2)  and 
(Sm+yll)w  =  ml- m2  produce  the  same  solution  as  well. 
Consider  that  the  overall  mean  m0  is  0 .  From 

lm0  =  l]m]  +l2m2  =  0  ,  we  have  ml  =  y  m, m2  =  -  y  m  ,  and 


/,m,  -  l2m2 


— —m.  Thus 

/ 


(Sm  +yll)w  =  (l1m1  -l2m2) 

becomes 


(Sm  +  Y  H)w  =  ~~  m. 


With  constant  c  =  *  2  ,  the  solution  of  (Sm  +yll)w  =  cm  is 

still  in  the  same  direction  along  the  mean  difference  m  ,  and 
thus  is  equivalent  to  solving  (Sw  +  y  //)-1SfcO  =  OA  . 
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